Reconstruction of 3D Models from Object Silhouettes¶

The process of reconstructing 3D models can be broadly categorized into two classes of algorithms. The first class involves the computation of disparity and depth maps from multiple views of an object, which are then registered to create a single 3D model. In contrast, the subject of this tutorial revolves around the second class of reconstruction algorithms, which embrace a volumetric-based methodology to depict the scene. This approach is commonly referred to as 'Shape-from-Silhouette'.

To illustrate, let's consider the video below, where an object is captured from 18 different angles, providing the essential data for constructing our 3D models. In its simplest form, volumetric-based 3D reconstruction can be described as follows: Initially, the space encompassing the object to be reconstructed is divided into small cubic volumes called voxels. Each voxel is projected onto image coordinates. If a projected voxel falls within the contour, it is considered a part of the object; conversely, a voxel outside the contour is not regarded as belonging to the object.

Now, let's dive into the detail of shape-from-silhouette.

RULES: As usual, OpenCV is banned in this repository.

Acknowledgment: Please note that this tutorial will follow the methodology used in "Reconstruction of Volumetric 3D Models" by Peter Eisert, allowing readers to easily reference the source for additional information when needed. Additionally, the majority of the figures have been adapted from the same paper.

No description has been provided for this image

A statue is captured from 18 different angles.

In [ ]:
import numpy as np
from pathlib import Path
from natsort import natsorted
import matplotlib.patches as patches
import matplotlib.pyplot as plt
from skimage import io

Introdcution¶

The fundamental concept of shape-from-silhouette revolves around the idea that any object must entirely fit within its contour. When an object is observed from a known perspective under a perspective projection, the rays originating from the focal point and passing through the silhouette contour create what we can call the hull of a viewing cone. This viewing cone, as depicted on the left-hand side of the figure below, sets an upper boundary for the object's shape. The actual volume of the object is guaranteed to be less than or equal to this rough approximation. Importantly, no specific assumptions are made regarding the viewing position, except for the camera calibration data.

For any given viewing position, the silhouette defines a viewing cone that fully encompasses the object. As the object's volume is constrained by all the individual cones, it is constrained to reside within the intersection of these viewing cones. The right-hand side of the figure below illustrates this concept, where only points in the 3D space that lie inside all viewing cones are considered part of the object to be reconstructed.

However, it's important to note that shape-from-silhouette methods have limitations. They cannot reconstruct all possible shapes. For instance, consider the concavity shown on the right-hand side of the figure below – this feature is never visible in the silhouette and, consequently, cannot be recovered. Rather than obtaining the true shape, the method can only provide an estimation of the visual hull.

No description has been provided for this image No description has been provided for this image

Left: Viewing cone through the image silhouette containing the object. Right: Intersection of multiple viewing cones.

Thus, the entire procedure of shape-from-silhouette can be summarized in the following steps:

  • Calibration of the cameras to determine their position, orientation, and intrinsic camera parameters.
  • Segmentation of the object from the background in the captured images to derive the object's contour.
  • Estimation of the bounding volume in 3D world coordinates.
  • Projecting voxels into image coordinates and accumulating valid voxels.

Camera Calibration and Camera Projection Matrix¶

The camera projection matrix is derived from the process of camera calibration, which entails extracting pairs of points in 3D world coordinates and their corresponding 2D image coordinates. Fortunately, this information has been provided in this example, allowing us to simply read the camera projection matrix as shown below. images contains 18 grayscale images, and camera_matrices holds the associated camera projection matrices for these 18 images.

Note: If you want to know more about camera calibration, please visit this tutorial.

In [ ]:
data_folder = Path("./input/")
images = []
camera_matrices = []

for img_path, mat_path in zip(
    natsorted([f for f in data_folder.glob("*.jpg")]),
    natsorted([f for f in data_folder.glob("*.pa")]),
):
    assert img_path.stem == mat_path.stem
    img = io.imread(img_path, as_gray=True)
    img *= 255.0
    img[img >= 255] = 255
    img = img.astype(np.uint8)
    assert img.ndim == 2
    images.append(img)
    camera_matrices.append(np.genfromtxt(mat_path))

assert images[0].shape == (480, 640) and camera_matrices[0].shape == (3, 4)
num_views = len(images)
print(
    f"We first convert images from RGB to grayscale.\n"
    + f"Each image's size is {images[0].shape} and its range is from {np.min(img)} to {np.max(img)}.\n"
    + f"Finally, there is an associated {camera_matrices[0].shape} projection matrix for each image.\n"
    + f"There are {num_views} views."
)
We first convert images from RGB to grayscale.
Each image's size is (480, 640) and its range is from 0 to 255.
Finally, there is an associated (3, 4) projection matrix for each image.
There are 18 views.

Segmentation of the Object from the Background¶

Now, our objective is to generate silhouettes for the statue in these 18 images, effectively segmenting the statue from the background. The initial step involves creating binary silhouette images from the original images. Once again, we are fortunate in this example. With the white statue against a black background, we can apply a simple threshold threshold to create a binary mask, denoted as mask. In this mask, statue pixels are represented as 1, while non-statue pixels are denoted as 0. Consequently, the silhouettes variable is a list containing 18 binary masks.

In [ ]:
threshold = 101

num_col, num_row = 6, 3
fig, ax = plt.subplots(nrows=num_row, ncols=num_col, figsize=(33, 13))
silhouettes = []
for row in range(num_row):
    for col in range(num_col):
        mask = images[row * num_col + col] >= threshold
        silhouettes.append(mask)
        ax[row, col].imshow(mask, cmap="gray")
        ax[row, col].set_axis_off()
plt.tight_layout()
No description has been provided for this image

Estimation of the bounding volume in 3D world coordinates¶

In the following steps, our objective is to estimate the bounding volume in 3D world coordinates, which signifies the space we intend to project into 2D image coordinates. This process involves working with three distinct sets of coordinates:

  • 2D image coordinates: These are the coordinates in the image itself.
  • 3D world coordinates: Representing the 3D space in the real world.
  • 3D volume coordinates: Referring to coordinates within the volumetric space.

Our primary objective is to project the position of a voxel from 3D volume coordinates to 3D world coordinates. This transformation allows us to locate the voxel in the real-world 3D space. Subsequently, we project this 3D world coordinate into 2D image coordinates, as illustrated in the image below.

The projection matrix from 3D world coordinates to 2D image coordinates is already provided as part of the assignment. Our current task is to determine how to construct the projection matrix for transforming 3D volume coordinates into 3D world coordinates. This transformation is accomplished through the volume_to_world function. It's worth noting that the y-axis undergoes a flip during this process.

No description has been provided for this image

Representation of a scene by a 3D array of volume elements - voxels.

After conducting experiments, I've determined that the 3D volume coordinates can be defined as follows in Python:

volume_x = int(64)
volume_y = int(64)
volume_z = int(256)

These values set the dimensions for the 3D volume, with volume_x, volume_y, and volume_z representing the size in the x, y, and z dimensions, respectively.

Additionally, the 3D world coordinates can be represented using a 2x3 NumPy array, bbox_3d, as follows:

bbox_3d = np.array([[0.3, -0.1, -1.9], [2.1, 1.3, 2.5]])

In this 2x3 array, each column specifies the starting and ending points for each axis. For example, the x-axis spans from 0.3 to 2.1, the y-axis from -0.1 to 1.3, and the z-axis from -1.9 to 2.5."

We proceed by creating volume_3d_corners, which represents the eight corners of the 3D volume in volume coordinates. Our goal is to project each of these corners into image coordinates. However, it's important to note that this projection results in the loss of depth information in the image coordinates. As a result, we obtain eight corresponding points in 2D image coordinates. To determine the smallest bounding box that encompasses these eight points, we utilize the get_2D_bbox function.

As illustrated in the graph below, a 2D bounding box is depicted on each 2D image, and each of these bounding boxes effectively encapsulates the entire statue.

In [ ]:
def volume_to_world(bbox, x_length, y_length, z_length):
    res_x = (bbox[1, 0] - bbox[0, 0]) / x_length
    res_y = (bbox[1, 1] - bbox[0, 1]) / y_length
    res_z = (bbox[1, 2] - bbox[0, 2]) / z_length
    eye = np.array(
        [
            [1, 0, 0, bbox[0, 0]],
            [0, 1, 0, bbox[0, 1]],
            [0, 0, 1, bbox[0, 2]],
            [0, 0, 0, 1],
        ]
    )
    v_to_w = np.dot(eye, np.diag([res_x, res_y, res_z, 1]))
    flip_y = np.array([[1, 0, 0, 0], [0, 0, 1, 0], [0, -1, 0, 0], [0, 0, 0, 1]])
    return np.dot(flip_y, v_to_w)


def get_2D_bbox(normalized_2d_pts):
    _, num_pts = normalized_2d_pts.shape
    min_x, min_y, max_x, max_y = np.inf, np.inf, -np.inf, -np.inf
    for i in range(num_pts):
        min_x = min(min_x, normalized_2d_pts[0, i])
        min_y = min(min_y, normalized_2d_pts[1, i])
        max_x = max(max_x, normalized_2d_pts[0, i])
        max_y = max(max_y, normalized_2d_pts[1, i])
    return min_x, min_y, max_x, max_y


bbox_3d = np.array([[0.3, -0.1, -1.9], [2.1, 1.3, 2.5]])
volume_x = int(64)
volume_y = int(64)
volume_z = int(256)
proj_volume2world = volume_to_world(bbox_3d, volume_x, volume_y, volume_z)
volume_3d_corners = np.array(
    [
        [0, 0, 0, 1],
        [0, 0, volume_z, 1],
        [0, volume_y, 0, 1],
        [0, volume_y, volume_z, 1],
        [volume_x, 0, 0, 1],
        [volume_x, 0, volume_z, 1],
        [volume_x, volume_y, 0, 1],
        [volume_x, volume_y, volume_z, 1],
    ]
).T  # (4, 8)

fig, ax = plt.subplots(nrows=num_row, ncols=num_col, figsize=(33, 13))
for row in range(num_row):
    for col in range(num_col):
        cam = camera_matrices[row * num_col + col]
        ax[row, col].imshow(images[row * num_col + col], cmap="gray")
        bbox_2d = np.dot(cam, np.dot(proj_volume2world, volume_3d_corners))
        normalized_pts = np.divide(bbox_2d, bbox_2d[2, :])
        min_x, min_y, max_x, max_y = get_2D_bbox(normalized_pts)
        x, y, w, h = min_x, min_y, max_x - min_x, max_y - min_y
        bounding_box = patches.Rectangle(
            (x, y), w, h, linewidth=1, edgecolor="y", facecolor="none"
        )
        ax[row, col].add_patch(bounding_box)  # Add the bounding box to the plot
        ax[row, col].set_axis_off()
plt.tight_layout()
plt.show()
No description has been provided for this image

Voxel Projection and Accumulation¶

After establishing the 3D volume in world coordinates, the subsequent critical step involves projecting the voxels into 2D image coordinates. For each voxel with a known 3D position, we calculate its projection onto 2D image coordinates. Voxels that fall outside the silhouette in at least one view are removed from the volume. In contrast, voxels that fall within the silhouette for all views are awarded one vote each. After processing all the voxels, we derive a 3D voxel array. Voxels with precisely 18 votes are recognized as part of the statue. The reason for this specific vote count is that we have 18 different views of the statue.

In [ ]:
volume_votes = np.zeros((volume_x, volume_y, volume_z))

volume_coordinates = []
for i in range(volume_x):
    for j in range(volume_y):
        for k in range(volume_z):
            volume_coordinates.append((i, j, k, 1))
volume_coordinates = np.stack(volume_coordinates, axis=0).T
assert volume_votes.shape == (volume_x, volume_y, volume_z)
assert volume_coordinates.shape == (4, volume_x * volume_y * volume_z)

for cam, mask in zip(camera_matrices, silhouettes):
    _, num_points = volume_coordinates.shape
    bbox_2d = np.dot(cam, np.dot(proj_volume2world, volume_coordinates))
    normalized = np.divide(bbox_2d, bbox_2d[2, :])
    for i in range(num_points):
        u, v = int(normalized[0, i]), int(normalized[1, i])
        if mask[v, u] == 1:
            volume_votes[
                volume_coordinates[0, i],
                volume_coordinates[1, i],
                volume_coordinates[2, i],
            ] += 1

Visualization of the Accumulated Voxels Array¶

The last thing we need to do is to visualize the accumulated voxels array in a rotating 3D view. To achieve this, we create a view for each rotation angle and save it as individual frames. While creating a 3D animation is an option, for this purpose, we'll focus on generating static views for each rotation angle.

In [ ]:
voxel_array = volume_votes == num_views
assert voxel_array.shape == (volume_x, volume_y, volume_z)

rotation_speed_deg_per_sec = 20  # Define the rotation parameters

# Create images for each frame
for frame in range(num_views):
    # Create `fig` here. Calling `ax` multiple times exploits the memory.
    fig = plt.figure()
    ax = fig.add_subplot(111, projection="3d")
    ax.set_axis_off()
    ax.set_box_aspect((volume_x, volume_y, volume_z))
    ax.set_facecolor("silver")  # Set the background color

    angle_deg = frame * rotation_speed_deg_per_sec
    filename = f"./output/david_3d_{frame}.png"
    print(filename)
    ax.voxels(voxel_array, facecolors="white", edgecolor="k", linewidth=0.1)
    ax.view_init(
        elev=20.0, azim=int((270 + angle_deg) % 360)
    )  # Sync the angle with the input "david_**.jpg"

    # Save the figure
    plt.savefig(filename, format="png", bbox_inches="tight", pad_inches=0, dpi=200)

    # Close the figure
    plt.close(fig)
    # break
./output/david_3d_0.png
./output/david_3d_1.png
./output/david_3d_2.png
./output/david_3d_3.png
./output/david_3d_4.png
./output/david_3d_5.png
./output/david_3d_6.png
./output/david_3d_7.png
./output/david_3d_8.png
./output/david_3d_9.png
./output/david_3d_10.png
./output/david_3d_11.png
./output/david_3d_12.png
./output/david_3d_13.png
./output/david_3d_14.png
./output/david_3d_15.png
./output/david_3d_16.png
./output/david_3d_17.png

Reference:

  • "Reconstruction of Volumetric 3D Models" by Peter Eisert.
  • "Multi-hypothesis, volumetric reconstruction of 3-D objects from multiple calibrated camera views" by Peter Eisert, Eckehard Steinbach, and Bernd Girod.
  • "Automatic 3D Model Construction for Turn-Table Sequences" by Andrew W. Fitzgibbon, Geoff Cross, and Andrew Zisserman.
  • Animate a rotating 3D graph in matplotlib
  • how to set "camera position" for 3d plots using python/matplotlib?